Translation of Unknown Words in Low Resource Languages
نویسندگان
چکیده
We address the problem of unknown words, also known as out of vocabulary (OOV) words, in machine translation of low resource languages. Our technique comprises a combination of methods, inspired by the common OOV types observed. We also design evaluation techniques for measuring coverage of OOVs achieved and integrate the new translation candidates in a Statistical Machine Translation (SMT) system. Experimental results on Hindi and Uzbek show that our system achieves a good coverage of OOV words. We show that our methods produced correct candidates for 50% of Hindi OOVs and 30% of Uzbek OOVs, in scenarios that have 1 and 3 OOVs per sentence. This offers a potential for improvement of translation quality for languages that have limited parallel data available for training.
منابع مشابه
Learning Translations for Tagged Words: Extending the Translation Lexicon of an ITG for Low Resource Languages
We tackle the challenge of learning part-ofspeech classified translations as part of an inversion transduction grammar, by learning translations for English words with known part-of-speech tags, both from existing translation lexica and from parallel corpora. When translating from a low resource language into English, we can expect to have rich resources for English, such as treebanks, and smal...
متن کاملمدل ترجمه عبارت-مرزی با استفاده از برچسبهای کمعمق نحوی
Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...
متن کاملSublexical Translations for Low-Resource Language
Machine Translation (MT) for low-resource language has low-coverage issues due to Out-OfVocabulary (OOV) Words. In this research we propose a method using sublexical translation to achieve wide-coverage in Example-Based Machine Translation (EBMT) for English to Bangla language. For sublexical translation we divide the OOV words into sublexical units for getting translation candidates. Previous ...
متن کاملBilingual Lexicon Induction for Low-resource Languages
Statistical machine translation relies on the availability of substantial amounts of human translated texts. Such bilingual resources are available for relatively few language pairs, which presents obstacles to applying current statistical translation models to low-resource languages. In this work, we induce bilingual dictionaries from more plentiful monolingual corpora using a diverse set of c...
متن کاملExample-Based Machine Translation for Low-Resource Language Using Chunk-String Templates
Example-Based Machine Translation (EBMT) for low resource language, like Bengali, has low-coverage issues, due to the lack of parallel corpus. In this paper, we propose an EBMT for low resource language, using chunk-string templates (CSTs) and translating unknown words. CSTs consist of a chunk in source-language, a string in target-language, and word alignment information. CSTs are prepared aut...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016